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I Abstract 

Li^ , Favaro, Lijoi, and Priinster (2012, Biometrics, 68, 1188-1196) derive a 

r^ ' novel Bayesian nonparametric estimator of the probability of detecting at 

the (n + m + l)th observation a species already observed with any given 
frequency in an enlarged sample of size n + m, conditionally on a basic sam- 
ple of size n. Unfortunately the general result under Gibbs priors (Theorem 
2), and consequently the explicit result under {a,0) Poisson-Dirichlet priors 
(Proposition 3), appear to be wrong. Here we provide the correct formu- 
las for both the results, obtained by means of a new technique devised in 
^ , Cerquetti (2013). We verify the correctness of our derivation by an explicit 

^D ■ counterproof for the two-parameter Poisson-Dirichlet case. 

cn , 

^^ i Keywords Bayesian nonparametrics; Discovery probability; Gibbs priors; Multi- 
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1 Introduction 



^ i In species sampling problems, particularly in ecological or biological studies, 

H I given a vector (ni, . . . ,nj), X]i=i ^« — "■' reporting the multiplicities of j dif- 

" ■ ■ ferent species observed in order of appearance among the first n observations, 

interest may lie in estimating the probability to detect, at step n + m + 1, a species 
already observed with any given frequency both among the species belonging to 
the basic n— sample, or among the new species eventually arising in the additional 
?7i— sample which is still to be observed. In Favaro, Lijoi, and Priinster (2012) 
a Bayesian nonparametric solution is provided as a generalization to the prob- 
lem of estimating the discovery probability, i.e. the probability to discover a new 
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species not represented in the previous n+m observations, already solved in Lijoi, 
Mena, and Priinster (2007). See the original article for a discussion and a review 
of previous proposals in both frequentist and Bayesian approaches. As a common 
prior assumption the authors assume that the theoretically infinite sequence of 
unknown relative abundances of the species in the population, {Pi)i>i, follows 
a random discrete distribution belonging to the Gibbs family, (Gnedin and Pit- 
man, 2006) i.e. such that, by Kingman's correspondence (Kingman, 1978), the 
probability to observe a specific partition of the first n observations among the 
first j species with multiplicities (ni, . . . ,nj) (called the exchangeable partition 
probability function, EPPF) is given by 



p{ni,...,nj) = ^ E 



U^U' 



K,,n(l-a)„,_i, (1) 



i=l 



where {li, . . . ,lj) ranges over all ordered j-tuples of distinct positive integers, and 
{x)y = x{x + 1) ■ ■ ■ {x + y — 1) stands for rising factorials. Here the V = iVn,]) 
represent specific Gibbs coefficients identifying a specific Gibbs model belonging 
to one of the three classes devised by Gnedin and Pitman (2006), and arise as 
mixtures of three different extreme partitions coefficients for a € (— oo, 0), a = 
and a € (0,1). This is equivalent to assume that the theoretically infinite se- 
quence of species labels (Xi)j>i is exchangeable with almost surely discrete de 
Finetti measure representable as P(-) = Yl'^i^i^Yii'), for (Pi) any rearrange- 
ment of the ranked frequencies (-P/) satisfying ([1]), independent of (1^) -^ IID 
H{-), for H some non atomic probability distribution. 

While testing, through an example, the computational advantage provided 
by the technique based on marginals of multivariate Gibbs distributions devised 
in Cerquetti (2013), in deriving Bayesian nonparametric estimators in species 
sampling problems under Gibbs priors, we find the result in Theorem 2. under 
general Gibbs priors, and consequently the particular result in Proposition 3. 
under two-parameter Poisson-Dirichlet priors, (Pitman and Yor, 1997) in Favaro 
et al. (2012) to be affected by a mistake. Despite it is just a subtle problem 
in a summation limit, it actually produces concrete consequences on calculation 
of the estimator, as from the (a, 0) particular case. The aim of this note is to 
provide potential readers of the Biometrics paper with the correct formulas, in 
view of possible applications of the Favaro et al. (2012) results under different 
classes of Gibbs priors and different datasets, or for further derivations of related 
theoretical results |j 

We start by noticing that the complexity of the proof of Theorem 1. ob- 
tained in Favaro et al. (2012, cfr. Web Appendix pag. 2), and concerning the 
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Bayesian nonparametric estimator of the [0 : k]-discovery, namely the probabil- 
ity to observe at step n + 1 a species already represented k times among the 
first n observations, may be avoided by encoding the EPPF of a general Gibbs 
partition, p{ni, . . . ,nj) = Vnj Y\i=i{^ — 0)^-1, in terms of the counting vector 
of blocks/species of different sizes, namely 



q{ci,...,Cn) = Vn,j JJ[(1 -a)k-i] 



k=l 



for Ck = Yli=i^{^i ~ ^}' ^ ~ l,...,n. The [0 : k]-discovery would therefore 
simply follow as the conditional probability: 

fr .,S g(Cl,...,Cfc - l,Cfc+l +l,...,C„+l) 



q{ci,... ,Ck,... ,Cn) 

In what follows, as in Lijoi, Priinster and Walker (2008) and Cerquetti (2013), 
given the allocation of the first n observations in j species with multiplici- 
ties (ni, . . . ,nj), let Km be the random number of new species arising in the 
additional m sample, (Si, . . . , 5 {„)) the random vector of the multiplicities 
of the new species in exchangeable random order (Pitman, 2006, Eq. 2.7), 

Sm = Y^i=i 'S'i the total number of new observations belonging to new species 
and {Mi^m, ■ ■ ■ iMj,m) the random vector of the allocation of the additional ob- 
servations in the j old species for ^11=1 ^i,m = rn — Sm- 

2 Correct Estimation of the ?Ti-Step /c-Discovery 

2.1 Under General Gibbs priors 

From ([2]) notice that, given a basic sample (ni, . . . , n^), but assuming an interme- 
diate ?7i-sample still to be observed, the probability to observe at step n + m-\-l 
a species represented k times among the new species generated by the additional 
sample, can be expressed as the random quantity 

KLTic., V) = H.W !«±I!^±M±^(^ _ «), (3) 

In) 

for Km the random number of new species induced by the additional sample and 

^km ~ X]j=i ^{Si = k\Kn = j} the random number of new species represented 
k times. 

The next result has been established in Cerquetti (2013), cf. Propositions 8, 
by means of a far more easy proof of the one proposed in Favaro et al. (2012). For 



the sake of clarity we recall that non central generalized Stirling numbers and non 
central generalized factorial coefficients used in Favaro et al. (2012) are tied by 
the relation S~c'~°''~ = C{n, ^; a, —(3)/a^. We even recall the compact notation 
for generalized rising factorials {x)n'[y = {x){x + y){x + 2y) • • • (x + (n — l)y). 
Notice also that in formula (JH) may be useful to write the summation limits in the 
unconventional way X^^7=0' ^^ ^^ ^^ ubiquitous in Cerquetti (2013), to highlight 
that the correspondence with the parameters of the generalized Stirling numbers 
must be satisfied for the summation to make sense, and to exploit equation ([5]) 
when needed. 



Theorem 1. (Correction to Theorem 2, FLP 2012) Under a general (a, V) Gibbs 
prior, for Wj^J^ = Yli^ ^{Si = k\Kn = j}, the Bayesian nonparametric esti- 
mator o/P™^"fc+^(a, y) is given by 



m—k+l 



U:iUk) = {l-a),r) Y. %±^^™'4t^^""^"^' (4) 



k ^ Vr 

1=1 



-1,— a,— 7 



where S^^ are non central generalized Stirling numbers defined as the con- 

nection coefficients (cfr. e.g. Hsu and Shiue, 1998) 

n 

[^ -i)n = Y^ '5^i'~"'^(^)ct« (5) 

for (x)^-fa — ^(^ + a) • • • (a^ + (■? — l)a) generalized rising factorials. 

Proof. See Cerquetti (2013), proof of Proposition 8. 

Remark 2. The problem in Theorem 2 in Favaro et al. (2012) is in the sum- 
mation limits. The marginalization over the possible number / of new species 
arising in the additional ?Ti— sample, giving rise to at least one species represented 
k times, must range, as in formula (JH), between 1 (for m = k) and m — k-\-l (for 
m > k), since we can actually observe one species of size k and m — k species of 
size 1. In Favaro et al. (2012) the second term of formula (12), written explicitly, 
corresponds instead to 

y ■. m~k y 

^^~"^''\ k) ^ "+"^+^J+'^^_^ (j(^ _k,l-l;a, -{n - ja)) 

where C(n,^;a, — /?) stands for generalized factorial coefficients (Chalarambides, 
2005), showing the summation ranges until m — k, thus producing the wrong 
result. 



Remark 3. For Cj the number of species observed i times in the basic n— sample, 

f • — 



ot'L = Ei=i IK + Af^,„^ = k\ni,...,nj} and P(Mi,^ = mi,...,Mj, 



TTT'j,Sm = s\ni, . . . ,nj) Multivariate Polya-Gibbs distributions (cfr. Cerquetti, 
2013, Sect. 4) arising by the conditional allocation of new observations among 
old species, details of a far more easy derivation of the one proposed in Favaro et 
al. (2012), for the Bayesian nonparametric estimator, under general Gibbs prior, 
of the probability to observe at step n + m + 1 a species represented k times 
among the old species, can be found in Cerquetti (2013, cf. Proposition 9). Here 
we report the result. 

k ^ \ m—k+i T^ 

(6) 
thus confirming the first term in Equation (12) in Favaro et al. (2012) is correct. 



2.2 Under Two- Parameter Poisson-Dirichlet Priors 

The problem in the general result under Gibbs priors affects the correctness of 
the result in Proposition 3, where the estimator under two-parameter Poisson 
Dirichlet priors is derived. The following correction is needed. Even in this case 
we are able to provide a considerably more easy proof with respect to the one in 
Favaro et al. (2012, cfr. Web Appendix pag. 8). 

Proposition 4. (Correction to Proposition 3. FLP 2012) Specializing equation 
^ and 1^ for the two-parameter Poisson-Dirichlet (a, 9) model yields 

k 

fjafi (i,-^_sr ( '^ \ {6 + n-i + a)m-k+i{i - a)k+i-i , 
^n+™(/c)-Z^c.^^_^J (^ + n)^+i + 

{9 + ja) /m\ (1 -a)k{0 + a + n)m-k ,„>. 

Proof. By the specific form of the (V^,j) coefficients for the two-parameter {a,0) 
model, for a E (0, 1) and 9 > —a, 

yia,e) ^ jO + «)j-it« 

"'^' {9 + l)n-l ' 

and the multiplicative property of generalized rising factorials, {x)a+b^c = {x)a-\c{x+ 
ac)b^c, equation (JH) yields 



fAa,e),new {I - a)k fm\ "s^ fn, ■ \ Q- 



and by the same property, and the definition of non-central generalized Stirling 
numbers as connection coefficients SSh, 



{9 + ja) fm\ (1 — Q)k{9 + a + n] 



m—k 



+ n) \k {9 + n + l) 



Again, by the specific form of the (a, 6) Gibbs coefficients, equation Q yields 



fc / . ^ / \ m—k+i 






and by the definition of non-central generalized Stirling numbers as connection 
coefficients ^ 

k 

Ef m \ {e + n-i + a)m-k+i{i - a)k+i~i 
i^/'U-V (e + nU+i 



Remark 5. Relying on the proposed correction it turns out that the result in 
Favaro et al. (2012) underestimates the probability of interest under general 
Gibbs priors of an amount equal to 

/-. \ I ^^\ '^+»Ti+l,j+m— fe+l Q—l,—a, — {n—ja) 

(J- — ajfc I ^ I ^^r~ '^m~k,m-k ' 

that reduces to 

/I \ / "i \ Ki+m+l,i+m-fe+l 

(l-")fc| ' 



k V ■ 

since Sn,n~"'^^ = Sn,n~°^ = 1- Under (a, 9) Poisson-Dirichlet priors this quantity 
corresponds to 

.k){e + nuJ^ + '''^-'^'^-^ 

which is easily seen to be the term subtracted in Equation (16) in Proposition 3. 
in Favaro et al. (2012). 

2.3 A Counterproof 

To check the correctness of our correction, we verify that 

m+n 



U:UO) + E ^ntmW = 1 



k=l 

where C/^+m (0) is the Bayesian nonparametric estimator of the " discovery prob- 
ability" , the probability to discover a new species at observation Xn+m+i under 
two-parameter Poisson-Dirichlet prior, as already obtained in Favaro, Lijoi, and 
Priinster (2009, cfr. Eq. (7)) 

Tjoifi /f.x _ (^ + i")(^ + " + "')m 



+ n){9 + n+l) 



Proposition 6. Under two-parameter Poisson-Dirichlet (a, 9) priors 

m+n 



Efjo'fi /, ^ _ -I _ (^ + i«) {O + a + n), 
"^'"^ ^" {e + n) {e + n + i\ 



fc=i 



Proof. With respect to the components given by the new species, possible sizes 
range between 1 and m, hence, by the second term in ([7]), we calculate 

{0 + ja) Y^ fm\ (1 - a)k{0 + a + n)ra-k ^ 
(^e + n) f^^\k) (e + n + lU 

and by the definition of Beta-Binomial distribution of parameters {m, (1 — a), {6+ 
a + n)) (see e.g. Johnson and Kotz, 2005) 

+ i") f-, i9 + a + n)m 



+ n) \ (e + n + l), 
ja) {6 + ja) {9 + a + n 



[e + n] {e + n) {9 + n + l)ra 

It remains to check that summing the first term in d?]) over possible sizes of the 
old species yields (n — ja)/{9 + n), and in fact 

Y} — I— TTJ K* 

sr^ sr^ [ 'm \ {& + n - i + a)m-k+i{i - a)k+i^ 



k=i i=\ ^^ ~^^ *^^ "*" '^)m+l 



n m 



EY^ I m \{e + n-i + a)m-k+S-a)k+i-i 



and by the multiplicative property of rising factorials and the definition of Beta- 
Binomial distribution of parameters (tti, [i — a + 1),{0 + n — i -\- a)) 



X 


fc-j=0 


/ m 
\k — i 


yo- 


n 

->> 

1=1 

f n- 


'\e + n) 

i + a)m-k+i{'i' - 


a + l)k~i 


J 

n 
i=l 


(i- 


ie + n+l)m 

a) {n - ja) 
n) {9 + n) ■ 





2.4 Correction to the Matlab Code 

From our correction the Matlab code provided by Favaro et al. (2012) in the Web 
Appendix (cfr. folder BI0M_1793_sni_suppniat.zip, file discovery.m) to estimate 
the [m : k] discovery under (a, 6) Poisson-Dirichlet priors should read 

BNP_mk_discovery_estimate (k , i) = 
partial_sum+(exp(rising_factorial(l-sigina,k)) 
*binoniial_coef f icient (i ,k) * ( ( (theta+( j *sigma) ) 
*exp(rising_f actorial(theta+n+sigma,i-k)- 
rising_f actorial(theta+n, i+1) ) ) ) ) ; 

Estimates obtained in Table 3. in Section 3. "An Application to Genomic Data" 
should therefore be recalculated accordingly. 
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